Is there statistical evidence of fewer births on Feb 29 in the US?

An evaluation using 2000-2014 data

Leap day aversion? A 2012 calandar of daily births.

Is there statistical evidence that Feb 29 is an un-favored day to give birth in the US?

The 2012 calendar below, showing number of daily birth as represented by color and bubble size, shows quite a lot of variation in the number of births from one day to another. For example December 25th’s recorded number of births is small compared with its neighbors, and births on the weekends appear to be much lower than weekday counterparts.

With February 29th (leap day) coming up this week, you might wonder, are there also relatively fewer births on February 29th, circled on the calendar. Feb 29th is a weird birthday to have – only truly ‘celebrated’ every four years. But it’s hard to draw any conclusions about a dip in births being more than normal variation in daily births from this cursory look and from a single year of data. In what follows, we’ll assess the question, ‘Is there statistical evidence of preference against February 29th birthing based on number of daily births in the U.S. over a 15 year period, 2000 to 2014?’

I use ggcalendar functions to build this plot in a concise way.

library(ggcalendar)
births_df %>%
  filter(year == 2012) %>% 
  ggcalendar() + 
  aes(date = date) + 
  geom_point_calendar() +
  aes(size = births) +
  aes(color = births) +
  geom_text_calendar(aes(label = day(date)), 
                     color = "oldlace", 
                     size = 2) + 
  guides(
    colour = guide_legend("Births"),
    size = guide_legend("Births")
 ) + 
  geom_point_calendar(data = data.frame(date =
                                      as_date("2012-02-29")),
                      size = 7, color = "red", shape = 21, stroke = 2) + 
  scale_color_viridis_c() + 
  labs(title = "The year in 2012 in births")

Zooming in on the week of February 29, 2012 in the figure below, it’s easier to appreciate the dip in number of births that is observed on February 29th. But the question remains, Is this variability usual, or is there an aversion to birthing on February 29th that is statistically detectable looking at more leap days?

births_df |>
  filter(date >= as.Date("2012-02-26") &
           date <= as.Date("2012-03-03")) |>
  ggplot() + 
  aes(x = date, y = births) +
  geom_point() + 
  geom_line(linetype = "dashed") + 
  geom_label(aes(label = day_of_week)) + 
  geom_point(data = . %>% filter(date == as.Date("2012-02-29")),
             color = "red", shape = 1, 
             size = 20, stroke = 2, alpha = .8) +
  labs(title = "Births the week of Feb 29, 2012") 

I begin the analysis with a visual format that’s familiar to the audience, and where we can be pretty sure of the effects of relevant variables just by inspection

2000-2014 U.S. Births Data

The 15 years of data we’ll look at was available via the #TidyTuesday project – their October 2, 2018 dataset. Some data cleaning is required including constructing a full date variable from year, month and date_of_month variables. I also normalize the dates (to 2020) for superimposed within-year comparisons - importantly 2020 is a leap year so Feb 29th dates are valid and not dropped. I further include indicator variables that I anticipate to be important as well as an indicator variable that captures the condition of interest, ind_Feb_29th.

library(tidyverse)

births_path <- "https://raw.githubusercontent.com/EvaMaeRey/tableau/9e91c2b5ee803bfef10d35646cf4ce6675b92b55/tidytuesday_data/2018-10-02-us_births_2000-2014.csv"

readr::read_csv(births_path) %>% 
  filter(year != 2015) %>%
  mutate(month = str_pad(month, 2, pad = "0"),
         date_of_month = str_pad(date_of_month, 2, pad = "0")) %>% 
  mutate(date = paste(year, month, date_of_month, sep = "-") %>% as_date()) %>% 
  mutate(day_of_week = wday(date, label = T, week_start = "Monday") %>% factor(ordered = FALSE)) %>%
  mutate(ind_holiday = 
           (month == "12" & date_of_month %in% 24:31) |
           (month == "07" & date_of_month %in% c("04", "05")) |
           (month == "01" & date_of_month %in% c("01", "02")) | 
           (month == "10" & date_of_month == "31") | 
           (month == "11" & date_of_month %in% 20:30) |
           (month == "02" & date_of_month %in% 14)
           ) |>
  mutate(date_as_if_2020 = paste(2020, month, date_of_month, sep = "-") %>% as_date()) |>
  mutate(ind_weekend = wday(date) == 1 | wday(date) == 7) |>
  mutate(ind_Feb_29th = month(date) == 2 & day(date) == 29) |>
  mutate(ind_13th = day(date) == 13) |>
  mutate(ind_Fri13th = wday(date) == 6 & day(date) == 13) ->
births_df

Visual exploration

Baseline visualization of the time series

Visualizing the U.S. births time series, we see that there is tremendous variability in number of birth per day. The standard deviation for daily births is 2326 in this time span, with the average number of births around 11400, as marked in the visualization below.

I use ggxmean functions to show the mean value for the number of daily births across the time span.

births_df |> 
  ggplot() + 
  aes(x = date, y = births) + 
  geom_point(size = .2) ->
time_series_base


add_y_mean <- function(){
  list(ggxmean::geom_y_mean(linetype = "dotted"),
  ggxmean::geom_y_mean_label() )
}

time_series_base + 
  add_y_mean()



round(sd(births_df$births))
#> [1] 2326

Univariate distribution

Looking at the time series above or the univariate distribution of daily births below, the bi-modal pattern is striking.

I use function geom_x_mean() from ggxmean to quickly show and annotate the mean value.

ggplot(births_df) + 
  aes(x = births) + 
  geom_histogram() + 
  ggxmean::geom_x_mean() +
  ggxmean::geom_x_mean_label(size = 3) + 
  geom_rug(alpha = .2) + 
  labs(x = "Number of births") + 
  labs(title = "Distribution of Number of Births in the U.S. each day from 2000-2014" %>% str_wrap(65)) ->
univariate_base; univariate_base

Exploring bi-modality with weekend indicator

Breaking this data up, we explore the hypothesis that preference for scheduled birth (inducements or c-sections) is for weekdays. In the figure below see that most of the bimodality is explained by weekend v. weekday effects. The difference in means for these groups is substantial - around 4650 difference in number of births!

I use the ind_recode function from the ind2cat package which automatically recodes the indicator variable to meaningful categories. If the T/F indicator variable is used directly, in this plot facet labels would be TRUE and FALSE, and would be hard to interpret without looking a the plot source code.

univariate_base +
  facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) 

Day of weeks effects

When we return to our time-series plot, but breaking the plots up by weekend, you may notice further bimodality within the ‘weekend’ subplot suggesting differences within the weekend for Saturday and Sunday.

time_series_base + 
  facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) + 
  labs(title = "We explore the bimodal distribution looking at 'weekend effects'" %>% str_wrap(65))

Following on that observation, we investigate by-weekday (Sunday, Monday, Tuesday etc) differences in number of births below. We look first at the histogram and then the time series plot.

For each day we plot the 95% confidence interval around the mean based on t-tests; these are visualized as a red horizontal segment at the base of the histogram. The intervals are calculated independently by panel.

univariate_base +
  facet_wrap(~wday(date, label = T, week_start = "Monday"), ncol = 1)  + 
  ggxmean:::geom_tdist(color = "red", linewidth = 2)

We return to the time series display to look at the joint distribution with the time dimension.

time_series_base + 
  facet_wrap(~day_of_week, ncol = 2) + 
  labs(title = "Number of births 2000-2014 by day of week" %>% str_wrap(45))

Holidays and outlying observations

The above by-day-of-the-week exploration (and weekend vs. weekday) further exposes some outlying observations. We next explore if these lower-than-usual their counterparts are by and large holiday or holiday adjacent days. Due to time constraints, I’m relying a quickly coded indicator variable that I created that include many, but not all holiday and adjacent dates.

last_plot() + 
  aes(color = ind2cat::ind_recode(ind_holiday)) +
  labs(color = NULL) + 
  theme(legend.position = 'top', legend.justification = "left") + 
  scale_color_manual(values = c("grey75", "red"))

This exploration suggests that holidays drive much of the out-of-character observations.

Outlyingness might also be due to aversion to birth days due to superstition - for example the 13th of each month, especially Fridays the 13th, and Halloween might be more outlying though not official holidays.

Also, holiday adjacency might also lead to lower number of births - for example federal holidays that fall on the weekend are often observed on a Monday. The outliers that we observe on Monday are not categorized as holiday, but this likely due to the imperfect ind_holiday coding. President’s day was not captured in my coding, for example and always falls on a Monday.

Data pruning

Because February 29 is not a holiday or holiday-adjacent, it is appropriate to prune our data to relevant comparisons and try to remove likely holidays; this should lead to more precise estimates in our final analysis.

Due to time constraints and convenience, our pruning will be based on outlyingness instead of relying on purely substantive knowledge (i.e. using lists of federal and celebrated holidays, etc).

We’ll prune observations for which a linear model’s error is greater than 3 standard deviations from the mean residual error (which will be zero).

The linear model contain date and day of the week as categories that explain number of daily births; the date variable is fourth order with the first order term interacted with the day of the week.

To start, I visualize the ‘parallel lines’ model fit.

This visualization uses ggplot2 Stat extension via the new and experimental ggtemp package which allows creating new geom_ and stat_ functions with less ‘boiler plate’ code. A model of a more syntactically conventional construction of this layer function is featured in the ggplot2 extension cookbook.

compute_panel_lm_parallel <- function(data, scales){
  
  model <- lm(y ~ x*category + I(x^2) + I(x^3)+ I(x^4), data = data)
  
  data |>
    mutate(y = model$fitted)
  
}

ggtemp:::create_layer_temp(fun_name = "geom_ols_linear_parallel",
                           required_aes = c("x", "y", "category"),
                           compute_panel = compute_panel_lm_parallel,
                           default_aes = aes(color = after_stat(category)),
                           geom_default = GeomLine)

ggplot(births_df) +
  aes(x = date, y = births,
      color = day_of_week, category = day_of_week) +
  geom_point(alpha = .15) + 
  geom_ols_linear_parallel()

Now, we estimate model outside of the visualization tool.

births_df$x <- as.numeric(births_df$date)

m <- lm(births ~ x*day_of_week + I(x^2) + I(x^3)+ I(x^4), data = births_df)
births_df$residuals_wday <- m$residuals

Then we visualize the distribution of the model residuals and mark the threshold of 3 standard deviations from the mean, beyond which we’ll prune observations.

For convenience, we again use some functions from ggxmean

ggplot(births_df) +
  aes(x = residuals_wday) +
  geom_histogram() + 
  ggxmean::geom_x_mean() + 
  ggxmean:::geom_x3sd(linetype = "dashed")


sd(births_df$residuals_wday)
#> [1] 843.1835

The code below prunes the data and shows a basic time series plot with the new

births_df$residuals_large <- abs(births_df$residuals_wday) > 
           3*sd(births_df$residuals_wday)

births_df_pruned <- births_df |>
  filter(!residuals_large)

time_series_base %+% 
  births_df + # updating data which now has additional varialbes
  facet_wrap(~day_of_week, ncol = 2) + 
  labs(title = "Outlier pruning visualization" %>% str_wrap(50)) + 
  aes(color = residuals_large) + 
  scale_color_manual(values = c("grey", "magenta2"), 
                     labels = c("kept", "dropped")) + 
  theme(legend.position = "top", legend.justification = "left") +
  labs(color = NULL)

  

num_dropped <- dim(births_df)[1] - dim(births_df_pruned)[1]

Dropping the large residuals results in 115 observations dropped from our dataset.

In the remainder of the exploration and analysis, we’ll use the pruned version of the data, unless otherwise stated.

Multiple linear regression modeling

Given the relationships uncovered in the exploratory visualizations, factors that should be included in any model of the number of daily births in the US should include seasonality, long term trends and day-of-week effects. We also model from our pruned data, dropping likely holidays as we are not interested in explaining the variation related to holidays given the 29th’s non-holiday status.

We account for long term trends by using indicator variables for each year; for seasonality using month indicator variables for each month; and for day-of-week effects using an indicator variable for each day of the week. The variable of interest, ‘being February 29th’ is also an indicator variable.

births_base_model <- lm(formula = births ~
                          ind_Feb_29th +
                          day_of_week +
                          month + 
                          factor(year),
                        data = births_df_pruned) 

Based on model diagnostic plots, the inclusion of some observations do deserve additional review.

plot(births_base_model)

stargazer::stargazer(births_base_model, type = "html", 
                     omit = c("year", "month", "day_of_week"), 
                     omit.labels=c("year", "month", "day of week"), 
                     omit.yes.no = c("yes", "no"))
Dependent variable:
births
ind_Feb_29th -862.494***
(208.404)
Constant 11,708.120***
(31.989)
year yes
month yes
day of week yes
Observations 5,364
R2 0.968
Adjusted R2 0.967
Residual Std. Error 414.300 (df = 5331)
F Statistic 4,983.343*** (df = 32; 5331)
Note: p<0.1; p<0.05; p<0.01

Narrowing to calendar date peers

births_df |> 
     filter(date_as_if_2020 >= as.Date("2020-02-25")) |>
     filter(date_as_if_2020 <= as.Date("2020-03-4")) |>
  ggplot() + 
  aes(x = date_as_if_2020, y  = births) + 
  geom_vline(xintercept = as.Date("2020-02-29"), 
             linetype = "dotted",
             color = "magenta3",
             linewidth = 1
             ) + 
  geom_line( alpha = .2) +
  geom_label(aes(label = wday(date, label = T), fill = ind2cat::ind_recode(ind_Feb_29th)), 
            size = 4, 
            ) + 
  facet_wrap(~year) +
  aes(group = paste(year, ind_weekend, isoweek(date))) + 
  scale_size_discrete(range = c(4,6)) + 
  scale_fill_manual(values = c("white", "magenta1")) + 
  theme(legend.position = "top", legend.justification = "left")


last_plot() + 
  facet_null() + 
  aes(color = year)

births_model_feb29_peers <- lm(formula = 
                                 births ~ 
                                 ind_Feb_29th +
                                 day_of_week + 
                                 date_as_if_2020 + 
                                 factor(year), 
                               data = births_df_pruned |> 
     filter(date_as_if_2020 >= as.Date("2020-02-16")) |>
     filter(date_as_if_2020 <= as.Date("2020-03-9"))) 
stargazer::stargazer(births_base_model, births_model_feb29_peers, type = "html", 
                     omit = c("year", "month", "day_of_week"), 
                     omit.labels=c("year", "month", "day of week"), 
                     omit.yes.no = c("yes", "no"))
Dependent variable:
births
(1) (2)
ind_Feb_29th -862.494*** -867.752***
(208.404) (146.946)
date_as_if_2020 1.823
(2.326)
Constant 11,708.120*** -21,789.590
(31.989) (42,610.810)
year yes yes
month yes no
day of week yes yes
Observations 5,364 334
R2 0.968 0.984
Adjusted R2 0.967 0.982
Residual Std. Error 414.300 (df = 5331) 285.891 (df = 311)
F Statistic 4,983.343*** (df = 32; 5331) 843.955*** (df = 22; 311)
Note: p<0.1; p<0.05; p<0.01